Skip to content

[fix](json) Fix strict json validation#63309

Open
mrhhsg wants to merge 2 commits into
apache:masterfrom
mrhhsg:fix-json-strict-extract
Open

[fix](json) Fix strict json validation#63309
mrhhsg wants to merge 2 commits into
apache:masterfrom
mrhhsg:fix-json-strict-extract

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented May 16, 2026

What problem does this PR solve?

Related PR: None

Problem Summary: Fix json_valid accepting incomplete JSON input or JSON with trailing non-whitespace content. jsonb_extract_* keeps the historical jsonb_extract plus cast rewrite and follows normal cast rules.

Release note

Fix json_valid to reject incomplete or trailing JSON input.

Check List (For Author)

  • Test: Unit Test / FE build
    • ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonValidStrictTest
    • ./build.sh --fe
    • Added regression-test/suites/jsonb_p0/test_json_valid_strict.groovy, but local regression run did not complete because the worktree FE metadata/port state prevented a stable cluster startup.
  • Behavior changed: Yes. json_valid rejects incomplete/trailing JSON input. typed jsonb_extract_* behavior is unchanged and documented as intentional.
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 16, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test: The PR targets strict json_valid parsing and strict typed jsonb_extract_* semantics, with BE unit tests and a jsonb_p0 regression test. The added tests cover the core one-row cases but miss a multi-row constant-argument execution path.
  • Scope: The change is mostly focused, but the new typed numeric BE implementations need to preserve FunctionJsonbExtract's const-column contract.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, special lifecycle, config, storage format, or FE-BE protocol compatibility concern was identified.
  • Parallel paths: FE now sends typed functions directly to BE; this correctly requires BE implementations for those functions, but the const JSON + const path path is incomplete for numeric typed extractors.
  • Error handling/data correctness: One data correctness issue found: scalar_vector sizes the output from the unwrapped const path column rather than the input block row count.
  • Tests: Existing tests do not exercise typed numeric extraction over multi-row input with both arguments constant.
  • Observability/performance: No additional observability requirement or obvious performance blocker found.

User focus: No additional user-provided review focus was present.

const ColumnString::Offsets& roffsets, const NullMap* r_null_map,
Container& res, NullMap& null_map) {
size_t size = roffsets.size();
res.resize(size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sizes the result from the unwrapped path column, which is only one row when both the JSON argument and path argument are ColumnConst. FunctionJsonbExtract::execute_impl checks jsonb_data_const first and calls this scalar_vector branch even when path_const[0] is also true, so a query that evaluates jsonb_extract_int(jsonb_parse('{"a":1}'), '$.a') over a multi-row block can build a nested result column of size 1 and pair it with a null map sized to input_rows_count. That can either trip column-size invariants or return only the first row. Please handle the const/const case explicitly, or pass input_rows_count into this branch and fill that many result rows.

### What problem does this PR solve?

Issue Number: DORIS-25567

Related PR: None

Problem Summary: Fix json_valid accepting incomplete JSON input or JSON with trailing non-whitespace content. DORIS-25568 is documented as intentional behavior because typed jsonb_extract_* keeps the historical jsonb_extract plus cast rewrite and follows normal cast rules.

### Release note

Fix json_valid to reject incomplete or trailing JSON input.

### Check List (For Author)

- Test: Unit Test / FE build
    - ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonValidStrictTest
    - ./build.sh --fe
    - Added regression-test/suites/jsonb_p0/test_json_valid_strict.groovy, but local regression run did not complete because the worktree FE metadata/port state prevented a stable cluster startup.
- Behavior changed: Yes. json_valid rejects incomplete/trailing JSON input. typed jsonb_extract_* behavior is unchanged and documented as intentional.
- Does this need documentation: No
@mrhhsg mrhhsg force-pushed the fix-json-strict-extract branch from ad33600 to fdf8418 Compare May 16, 2026 03:50
@mrhhsg mrhhsg changed the title [fix](json) Fix strict json validation and typed extraction [fix](json) Fix strict json validation May 16, 2026
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 16, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking performance regression in the JSONB parser change. The fix goal is clear and the tests cover the new strict json_valid cases, but the implementation adds a full-document raw_json pre-pass on every JSONB parse path, which doubles traversal work for large JSON payloads used by json_valid, jsonb_parse, casts, and ingestion.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to make json_valid reject incomplete/trailing JSON; BE unit and regression coverage were added for representative invalid suffix cases.
  • Scope: the change is small, but the raw_json pre-pass is broader than needed for the goal.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, storage-format, or FE-BE protocol compatibility concerns found.
  • Parallel paths: the shared JsonbParser path covers json_valid and JSONB parsing/casts, but that also means parser overhead affects all those paths.
  • Error handling/data correctness: Status propagation is preserved; the strict validation intent is sound.
  • Tests: added tests cover trailing non-whitespace content, but not the parser performance impact. No additional user-provided focus points were present.

Comment thread be/src/util/jsonb_parser_simd.h Outdated
}

std::string_view raw_json;
simdjson::error_code raw_res = doc.raw_json().get(raw_json);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raw_json() call consumes the entire document, then doc.rewind() makes the code parse and serialize the same JSON again. JsonbParser::parse is shared by json_valid, jsonb_parse, casts to JSONB, and other ingestion paths, so large object/array inputs now pay an extra full simdjson traversal just to detect trailing content. The later code already consumes the root value and checks doc.at_end(), while top-level scalar getters use simdjson's root APIs that disallow trailing content and the new is_null() check covers the not/partial-null case. Please avoid the whole-document raw_json() pre-pass and enforce the remaining scalar/null validation in the existing parse pass so JSONB parsing stays single-pass.

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63309

Problem Summary: Avoid the extra raw_json pre-pass in strict JSONB parsing, and keep const JSON plus const path extract results sized to the input block.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonbParseTest:FunctionJsonbTEST.JsonbParseErrorToNullTest:FunctionJsonbTEST.JsonValidStrictTest:FunctionJsonbTEST.JsonExtractConstConstMultiRow
    - build-support/check-format.sh
    - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (failed due existing complexity/NOLINT diagnostics and local toolchain stddef.h lookup)
- Behavior changed: Yes. Const JSON plus const path non-string JSONB extract returns a correctly sized repeated result.
- Does this need documentation: No
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 16, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional blocking issues found in this review.

Critical checkpoint conclusions:

  • Goal and tests: the PR tightens JSON parsing so incomplete/trailing-content inputs are rejected, documents the existing typed JSON extract rewrite semantics, and fixes the const JSON plus const path result sizing path. Added BE unit coverage and a regression case target the changed behavior.
  • Scope: changes are small and focused on JSONB parsing/extraction and related tests/comments.
  • Concurrency/lifecycle: no new concurrency, locking, static initialization, or special lifecycle management introduced.
  • Compatibility/configuration/persistence: no new config, persistence, storage format, or FE-BE protocol changes. Behavior change is limited to stricter JSON text parsing.
  • Parallel paths: current final diff handles the previously raised const/const non-string JSON extract path; string/JSONB extract already uses the vector_vector_v2 path sized by input_rows_count.
  • Error handling: new parser failures return non-OK Status; no ignored new Status paths found in production code.
  • Test coverage: BE unit tests and regression coverage were added. I attempted ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonbParseTest:FunctionJsonbTEST.JsonbParseErrorToNullTest:FunctionJsonbTEST.JsonValidStrictTest:FunctionJsonbTEST.JsonExtractConstConstMultiRow, but it timed out after 120s during first-time submodule setup before executing the tests.
  • Observability/performance: no new observability need; the final parser implementation avoids the earlier whole-document raw_json pre-pass performance concern.
  • User focus: no additional user-provided review focus was specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants